61 research outputs found

    Provable Deterministic Leverage Score Sampling

    Full text link
    We explain theoretically a curious empirical phenomenon: "Approximating a matrix by deterministically selecting a subset of its columns with the corresponding largest leverage scores results in a good low-rank matrix surrogate". To obtain provable guarantees, previous work requires randomized sampling of the columns with probabilities proportional to their leverage scores. In this work, we provide a novel theoretical analysis of deterministic leverage score sampling. We show that such deterministic sampling can be provably as accurate as its randomized counterparts, if the leverage scores follow a moderately steep power-law decay. We support this power-law assumption by providing empirical evidence that such decay laws are abundant in real-world data sets. We then demonstrate empirically the performance of deterministic leverage score sampling, which many times matches or outperforms the state-of-the-art techniques.Comment: 20th ACM SIGKDD Conference on Knowledge Discovery and Data Minin

    Optimal CUR Matrix Decompositions

    Full text link
    The CUR decomposition of an m×nm \times n matrix AA finds an m×cm \times c matrix CC with a subset of c<nc < n columns of A,A, together with an r×nr \times n matrix RR with a subset of r<mr < m rows of A,A, as well as a c×rc \times r low-rank matrix UU such that the matrix CURC U R approximates the matrix A,A, that is, ACURF2(1+ϵ)AAkF2 || A - CUR ||_F^2 \le (1+\epsilon) || A - A_k||_F^2, where .F||.||_F denotes the Frobenius norm and AkA_k is the best m×nm \times n matrix of rank kk constructed via the SVD. We present input-sparsity-time and deterministic algorithms for constructing such a CUR decomposition where c=O(k/ϵ)c=O(k/\epsilon) and r=O(k/ϵ)r=O(k/\epsilon) and rank(U)=k(U) = k. Up to constant factors, our algorithms are simultaneously optimal in c,r,c, r, and rank(U)(U).Comment: small revision in lemma 4.

    Block CUR: Decomposing Matrices using Groups of Columns

    Full text link
    A common problem in large-scale data analysis is to approximate a matrix using a combination of specifically sampled rows and columns, known as CUR decomposition. Unfortunately, in many real-world environments, the ability to sample specific individual rows or columns of the matrix is limited by either system constraints or cost. In this paper, we consider matrix approximation by sampling predefined \emph{blocks} of columns (or rows) from the matrix. We present an algorithm for sampling useful column blocks and provide novel guarantees for the quality of the approximation. This algorithm has application in problems as diverse as biometric data analysis to distributed computing. We demonstrate the effectiveness of the proposed algorithms for computing the Block CUR decomposition of large matrices in a distributed setting with multiple nodes in a compute cluster, where such blocks correspond to columns (or rows) of the matrix stored on the same node, which can be retrieved with much less overhead than retrieving individual columns stored across different nodes. In the biometric setting, the rows correspond to different users and columns correspond to users' biometric reaction to external stimuli, {\em e.g.,}~watching video content, at a particular time instant. There is significant cost in acquiring each user's reaction to lengthy content so we sample a few important scenes to approximate the biometric response. An individual time sample in this use case cannot be queried in isolation due to the lack of context that caused that biometric reaction. Instead, collections of time segments ({\em i.e.,} blocks) must be presented to the user. The practical application of these algorithms is shown via experimental results using real-world user biometric data from a content testing environment.Comment: shorter version to appear in ECML-PKDD 201

    Approximations of Schatten Norms via Taylor Expansions

    Full text link
    In this paper we consider symmetric, positive semidefinite (SPSD) matrix AA and present two algorithms for computing the pp-Schatten norm Ap\|A\|_p. The first algorithm works for any SPSD matrix AA. The second algorithm works for non-singular SPSD matrices and runs in time that depends on κ=λ1(A)λn(A)\kappa = {\lambda_1(A)\over \lambda_n(A)}, where λi(A)\lambda_i(A) is the ii-th eigenvalue of AA. Our methods are simple and easy to implement and can be extended to general matrices. Our algorithms improve, for a range of parameters, recent results of Musco, Netrapalli, Sidford, Ubaru and Woodruff (ITCS 2018) and match the running time of the methods by Han, Malioutov, Avron, and Shin (SISC 2017) while avoiding computations of coefficients of Chebyshev polynomials

    Spectral Clustering: An Empirical Study of Approximation Algorithms and its Application to the Attrition Problem

    Get PDF
    Clustering is the problem of separating a set of objects into groups (called clusters) so that objects within the same cluster are more similar to each other than to those in different clusters. Spectral clustering is a now well-known method for clustering which utilizes the spectrum of the data similarity matrix to perform this separation. Since the method relies on solving an eigenvector problem, it is computationally expensive for large datasets. To overcome this constraint, approximation methods have been developed which aim to reduce running time while maintaining accurate classification. In this article, we summarize and experimentally evaluate several approximation methods for spectral clustering. From an applications standpoint, we employ spectral clustering to solve the so-called attrition problem, where one aims to identify from a set of employees those who are likely to voluntarily leave the company from those who are not. Our study sheds light on the empirical performance of existing approximate spectral clustering methods and shows the applicability of these methods in an important business optimization related problem

    Optimal Principal Component Analysis in Distributed and Streaming Models

    Full text link
    We study the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix ARm×n,A \in R^{m \times n}, a rank parameter k<rank(A)k < rank(A), and an accuracy parameter 0<ϵ<10 < \epsilon < 1, we want to output an m×km \times k orthonormal matrix UU for which AUUTAF2(1+ϵ)AAkF2, || A - U U^T A ||_F^2 \le \left(1 + \epsilon \right) \cdot || A - A_k||_F^2, where AkRm×nA_k \in R^{m \times n} is the best rank-kk approximation to AA. This paper provides improved algorithms for distributed PCA and streaming PCA.Comment: STOC2016 full versio

    Do Parents Recognize Autistic Deviant Behavior Long before Diagnosis? Taking into Account Interaction Using Computational Methods

    Get PDF
    BACKGROUND: To assess whether taking into account interaction synchrony would help to better differentiate autism (AD) from intellectual disability (ID) and typical development (TD) in family home movies of infants aged less than 18 months, we used computational methods. METHODOLOGY AND PRINCIPAL FINDINGS: First, we analyzed interactive sequences extracted from home movies of children with AD (N = 15), ID (N = 12), or TD (N = 15) through the Infant and Caregiver Behavior Scale (ICBS). Second, discrete behaviors between baby (BB) and Care Giver (CG) co-occurring in less than 3 seconds were selected as single interactive patterns (or dyadic events) for analysis of the two directions of interaction (CG→BB and BB→CG) by group and semester. To do so, we used a Markov assumption, a Generalized Linear Mixed Model, and non negative matrix factorization. Compared to TD children, BBs with AD exhibit a growing deviant development of interactive patterns whereas those with ID rather show an initial delay of development. Parents of AD and ID do not differ very much from parents of TD when responding to their child. However, when initiating interaction, parents use more touching and regulation up behaviors as early as the first semester. CONCLUSION: When studying interactive patterns, deviant autistic behaviors appear before 18 months. Parents seem to feel the lack of interactive initiative and responsiveness of their babies and try to increasingly supply soliciting behaviors. Thus we stress that credence should be given to parents' intuition as they recognize, long before diagnosis, the pathological process through the interactive pattern with their child

    The projection score - an evaluation criterion for variable subset selection in PCA visualization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In many scientific domains, it is becoming increasingly common to collect high-dimensional data sets, often with an exploratory aim, to generate new and relevant hypotheses. The exploratory perspective often makes statistically guided visualization methods, such as Principal Component Analysis (PCA), the methods of choice. However, the clarity of the obtained visualizations, and thereby the potential to use them to formulate relevant hypotheses, may be confounded by the presence of the many non-informative variables. For microarray data, more easily interpretable visualizations are often obtained by filtering the variable set, for example by removing the variables with the smallest variances or by only including the variables most highly related to a specific response. The resulting visualization may depend heavily on the inclusion criterion, that is, effectively the number of retained variables. To our knowledge, there exists no objective method for determining the optimal inclusion criterion in the context of visualization.</p> <p>Results</p> <p>We present the projection score, which is a straightforward, intuitively appealing measure of the informativeness of a variable subset with respect to PCA visualization. This measure can be universally applied to find suitable inclusion criteria for any type of variable filtering. We apply the presented measure to find optimal variable subsets for different filtering methods in both microarray data sets and synthetic data sets. We note also that the projection score can be applied in general contexts, to compare the informativeness of any variable subsets with respect to visualization by PCA.</p> <p>Conclusions</p> <p>We conclude that the projection score provides an easily interpretable and universally applicable measure of the informativeness of a variable subset with respect to visualization by PCA, that can be used to systematically find the most interpretable PCA visualization in practical exploratory analysis.</p

    Feature-by-Feature – Evaluating De Novo Sequence Assembly

    Get PDF
    The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the “excess-dimensionality” of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results
    corecore